The System and the Study

To evaluate the impact of Omitted Variable Bias (OVB) on different models, consider a system where oceanography drives site temperature and recruitment over time. Temperature also fluctuates over time within each site. Recruitment and temperature influence snail abundance, as do other uncorrelated drivers. You have conducted a study in this system measuring 10 sites, each site sampled once per year over 10 years. In this study, you have recorded snail abundance and temperature. But have no measure of recruitment. Note, the results of models will be the same if you had instead sampled in one year across 10 sites with 10 plots per site if there were plot-level drivers of temperature that behaved the same as below.

To parameterize our simulations, consider the following:

  • Our Oceanography latent variable has a mean of 0 and a SD of 1.
  • Site temperature is twice the oceanography variable and transformed to have a mean of 15C.
  • Site recruitment is -2 multiplied by the oceanography variable and transformed to have a mean of 15 individuals per plot.
  • Sites are sampled over 10 years.
  • Within a site, the temperature varies over time according to a normal distribution with a mean of 1.
  • There is a 1:1 relationship between temperature and snail abundance and recruitment and snails.
  • Other non-correlated drivers in the system influence snail abundance with a mean influence of 0 and a SD of 1.

Thus, the system looks like this:

Functions to Create The System

To simulated data, let’s begin by loading some libraries

library(tidyverse)
library(lme4)
library(broom)
library(broom.mixed)
library(DiagrammeR)
library(glue)

theme_set(theme_bw(base_size = 14))

Next, we need a function that will create a template of simulated sites based on oceanography and the sampling design described above.

make_environment <- function(n_sites = 10,
                             ocean_temp = 2,
                             temp_sd = 0,
                             ocean_recruitment = -2,
                             recruitment_sd = 0,
                             temp_mean = 15,
                             rec_mean = 10,
                             seed = NULL){
  
  if(!is.null(seed)) set.seed(seed)
  
  tibble(
    site = as.character(1:n_sites)) %>%
    mutate(
      oceanography = rnorm(n_sites),
      site_temp = temp_mean + 
        rnorm(n_sites, ocean_temp * oceanography, temp_sd),
      site_recruitment = rec_mean +
        rnorm(n_sites, ocean_recruitment * oceanography, recruitment_sd)
      
    )
}

Great. Now, we need to add that year-to-year or plot-to-plot variability.

make_plots <- function(sites_df,
                       n_plots_per_site = 10,
                       plot_temp_sd = 1,
                       temp_effect = 1,
                       recruitment_effect = 1,
                       sd_plot = 1,
                       seed = NULL){
  
  if(!is.null(seed)) set.seed(seed)
  
  sites_df %>%
    rowwise() %>%
    mutate(
      plot_temp_dev_actual = list(rnorm(n_plots_per_site,
                                        0, plot_temp_sd))) %>%
    unnest(plot_temp_dev_actual) %>%
    mutate(
      plot_temp = site_temp + plot_temp_dev_actual,
      snails = rnorm(n(),
                     temp_effect*plot_temp +
                       recruitment_effect*site_recruitment,
                     sd_plot)) %>%
    ungroup() %>%
    group_by(site) %>%
    mutate(year = 1:n(),
           site_mean_temp = mean(plot_temp),
           plot_temp_dev = plot_temp - site_mean_temp,
           site_mean_snail = mean(snails),
           site_snail_dev = snails - mean(snails),
           delta_snails = snails - lag(snails),
           delta_temp = plot_temp - lag(plot_temp)) %>%
    ungroup()
  
}

To analyze our data, we will compare several different fit models.

  • A naive linear model with no site term
  • A random effects using site as a random intercept
  • A fixed effects model where site is a fixed effect (i.e., turned into 1/0 dummy variables)
  • A model where we include the site mean temperature as a covariate and site is a random effect
  • A model where we include site mean temperature as a covariate and site mean centered temperature (i.e., temperature at a site in a year minus it’s mean over the entire data set). Site is included as a random effect
  • A panel model where we look at change in snails between years versus change in temperature between years with a random site effect
analyze_plots <- function(plot_df){
  
  m <-  tribble(
    ~model_type, ~fit,
    "Naive", lm(snails ~ plot_temp, data = plot_df),
    "RE", lmer(snails ~ plot_temp + (1|site), data = plot_df),
    "FE", lm(snails ~ plot_temp + site, data = plot_df),
    "Group Mean Covariate", lmer(snails ~ plot_temp + site_mean_temp + (1|site), data = plot_df),
    "Group Mean Centered", lmer(snails ~ plot_temp_dev + site_mean_temp + (1|site), data = plot_df),
    "Group Mean Covariate, no RE", lm(snails ~ plot_temp + site_mean_temp, data = plot_df),
    "Group Mean Centered, no RE", lm(snails ~ plot_temp_dev + site_mean_temp, data = plot_df),
    "Panel", lm(delta_snails ~ delta_temp,data = plot_df)
    
  ) %>%
    mutate(coefs = map(fit, tidy), #get coefficients with broom
           out_stats = map(fit, glance),
           temp_effect = map(coefs, get_temp_coef),
           model_type = fct_inorder(model_type))
  
  m
}

get_temp_coef <- function(a_tidy_frame){
  a_tidy_frame %>%
    filter(term %in% c("plot_temp", "plot_temp_dev", "delta_temp")) %>%
    select(estimate, std.error)
}

Simulations and Results

Let’s begin by setting up 100 replicate simulations.

set.seed(31415)
n_sims <- 100

envt <- tibble(
  sims = 1:n_sims
) %>%
  mutate(sites = map(sims, ~make_environment())) 

Just for a sanity check, here’s the relationship between temperature and recruitment at the site level across all simulations.

ggplot(envt %>%
         unnest(sites),
       aes(x = site_temp, y = site_recruitment, group = sims)) +
  geom_point(alpha = 0.4, color = 'grey') +
  stat_smooth(method = "lm", size = 0.5, alpha = 0.5,
              fill = NA, color = "black") +
  labs(x = "Site Average Temperature", 
       y = "Site Recruitment of Individuals Per Plot")

Great! Now, let’s setup our sampling over time.

plots_df <- envt %>%
  mutate(site_year = map(sites, make_plots))

And, again, a sanity check…

ggplot(plots_df %>%
         unnest(site_year),
       aes(x = plot_temp, y = snails, color = site)) +
  geom_point(alpha = 0.5)+
  labs(x = "Plot Temperature",
       y = "Snails per Plot")

So we can see that in this setup, the snail-temperature relationship is positive. But, how positive is it? What would our coefficients show?

Let’s fit models to each set of data

analysis_df <- plots_df %>%
  mutate(analysis = map(site_year, analyze_plots)) %>%
  unnest(analysis) 

And now let’s look at the distribution of coefficients that would describe the relationship between temperature and snails from each mode.

analysis_df %>%
  unnest(temp_effect) %>%
  ggplot(aes(y = fct_rev(model_type), x = estimate)) +
  ggridges::stat_density_ridges() +
  labs(y="", x = "Model Estimated Temperature Effect")

Eyeballing it, we can see of course the naive model is too low. How bad is the bias for the RE model?

analysis_df %>%
  unnest(temp_effect) %>%
  group_by(`Model Type` = model_type) %>%
  summarize(`Mean Estimate` = mean(estimate),
            `SD Estimate` = sd(estimate)) %>%
  knit_table
Model Type Mean Estimate SD Estimate
Naive 0.232 0.096
RE 0.869 0.159
FE 0.996 0.109
Group Mean Covariate 0.996 0.109
Group Mean Centered 0.996 0.109
Group Mean Covariate, no RE 0.996 0.109
Group Mean Centered, no RE 0.996 0.109
Panel 0.989 0.130

The downward bias produced by poor model choice is clear. We can see it if we plot 1 minus the coefficient

analysis_df %>%
  unnest(temp_effect) %>%
  ggplot(aes(y = fct_rev(model_type), x = 1- estimate)) +
  ggridges::stat_density_ridges() +
  labs(y="", x = "Bias")

We can also look at the distribution of, for each simulated data set, how different is the RE model from each other model.

analysis_df %>%
  filter(model_type != "Naive") %>%
  unnest(temp_effect) %>%
  group_by(sims) %>%
  mutate(diff_from_re = estimate - estimate[1]) %>%
  ungroup() %>%
  filter(model_type != "RE") %>%
  ggplot(aes(y = fct_rev(model_type), x = diff_from_re)) +
  ggridges::stat_density_ridges() +
  labs(y="", x = "Model Estimated Temperature Effect")

Note, in all cases, we can see the effects of downward bias. This is clear, but, to put it in numbers -

analysis_df %>%
  filter(model_type != "Naive") %>%
  unnest(temp_effect) %>%
  group_by(sims) %>%
  mutate(diff_from_re = estimate - estimate[1]) %>%
  ungroup() %>%
  filter(model_type != "RE") %>%
  group_by(model_type) %>%
  summarize(`Mean Diff from RE` = mean(diff_from_re),
            `SD Diff from RE` = sd(diff_from_re))  %>%
  knit_table
model_type Mean Diff from RE SD Diff from RE
FE 0.128 0.097
Group Mean Covariate 0.128 0.097
Group Mean Centered 0.128 0.097
Group Mean Covariate, no RE 0.128 0.097
Group Mean Centered, no RE 0.128 0.097
Panel 0.120 0.112

If we were doing straight hypothesis testing, how often would our estimate of the temperature coefficient either overlap 0 or not have 1 within its confidence interval?

analysis_df %>%
  filter(model_type != "Naive") %>%
  unnest(temp_effect) %>%
  mutate(overlap_0 = (estimate - 2*std.error)<0,
         overlap_1 = (estimate - 2*std.error)<1 &
           (estimate + 2*std.error)>1) %>%
  group_by(`Model Type` = model_type) %>%
  summarize(`95% CI Contains 0` = sum(overlap_0)/n(),
            `95% CI does Not Contain 1` = (n()-sum(overlap_1))/n()) %>%
  knit_table
Model Type 95% CI Contains 0 95% CI does Not Contain 1
RE 0.01 0.28
FE 0.00 0.05
Group Mean Covariate 0.00 0.05
Group Mean Centered 0.00 0.05
Group Mean Covariate, no RE 0.00 0.03
Group Mean Centered, no RE 0.00 0.03
Panel 0.00 0.12

Here we see the RE model, while rarely, can be subject to type II error. Further, it is far more likely than any other technique to not have the true coefficient value within 2 CI of its estimand.

A Wrapper Function for Simulation

This has been useful, but, if we want to automate the process for further exploration, let’s wrap the code above into a function.

make_sims_and_analyze <- function(n_sims = 100,
                                  n_sites = 10,
                                  ocean_temp = 2,
                                  temp_sd = 0,
                                  ocean_recruitment = -2,
                                  recruitment_sd = 0,
                                  temp_mean = 15,
                                  rec_mean = 10,
                                  n_plots_per_site = 10,
                                  plot_temp_sd = 1,
                                  temp_effect = 1,
                                  recruitment_effect = 1,
                                  sd_plot = 1,
                                  seed = NULL) {
  #should we set a seed?
  if (!is.null(seed))
    set.seed(seed)
  
  # make an envt data frame
  out_df <- tibble(sims = 1:n_sims) %>%
    mutate(
      sites = map(
        sims,
        make_environment,
        n_sites = n_sites,
        ocean_temp = ocean_temp,
        temp_sd = temp_sd,
        ocean_recruitment = ocean_recruitment,
        recruitment_sd = recruitment_sd,
        temp_mean = temp_mean,
        rec_mean = rec_mean
      )
    ) %>%
    #now add plots
    mutate(
      site_year = map(
        sites,
        make_plots,
        n_plots_per_site = n_plots_per_site,
        plot_temp_sd = plot_temp_sd,
        temp_effect = temp_effect,
        recruitment_effect = recruitment_effect,
        sd_plot = sd_plot
      )
    ) %>%
    
    #and analysis
    mutate(analysis = map(site_year, analyze_plots)) %>%
    unnest(analysis)
  
}

Is that Random Effect Needed?

If we look at the Group Mean Centered and Group Mean Centered model, what’s the RE?

analysis_df %>%
  filter(model_type %in% c("Group Mean Covariate", "Group Mean Centered")) %>%
  unnest(coefs) %>%
  filter(group == "site") %>%
  group_by(`Model Type` = model_type) %>%
  summarize(Term = "Site Random Effect",
            `Mean Site SD` = mean(estimate),
            `SD in Site SD` = sd(estimate)) %>%
  knit_table
Model Type Term Mean Site SD SD in Site SD
Group Mean Covariate Site Random Effect 0.257 0.179
Group Mean Centered Site Random Effect 0.257 0.179

Both are the same - but both are not distinguishable from 0 (indeed, you’d see some warnings above if you changed the warnings option in this markdown).

Why is that?

In the system we have setup, there is no uncorrelated site-level variability. It’s all at the site-year (or site-plot) level. We have indeed run these models with no site random effect. They produce the same answers.

So should we just not worry about a site-level random effect? No. That’s because in the simulations above, we have not allowed for variation at the site level that is independent of oceanography. We can change this. In the simulation code above, we can add variation in recruitment or site mean temperature due to forces other than oceanography by tweaking their SD. If we do this with site temperature, this should merely aid in the estimation of our coefficient for the temperature effect, as it’s adding temperature variability that is uncorrelated with recruitment.

Instead, we can add non-oceanography-driven variation via recruitment. let’s assume that there is additional variation in average site recruitment that is uncorrelated with oceanography. This is one way site-level variability that is not correlated with oceanography. Now, using our new function that ties the room together, let’s create some new analyses where there is an additional SD of 1 in recruitment and look at the RE.

rec_var_df <- make_sims_and_analyze(recruitment_sd = 1)

You can now see that the temp-recruitment relationship has some wiggle.

ggplot(rec_var_df %>%
         group_by(sims) %>%
         slice(1L) %>%
         ungroup() %>%
         unnest(sites),
       aes(x = site_temp, y = site_recruitment, group = sims)) +
  geom_point(alpha = 0.4, color = 'grey') +
  stat_smooth(method = "lm", size = 0.5, alpha = 0.5,
              fill = NA, color = "black")+
  labs(x = "Site Average Temperature", 
       y = "Site Recruitment of Individuals Per Plot")

The models with omitted variable bias still outperform those without.

rec_var_df %>%
    unnest(temp_effect) %>%
  group_by(`Model Type` = model_type) %>%
  summarize(`Mean Estimate` = mean(estimate),
            `SD Estimate` = sd(estimate)) %>%
  knit_table
Model Type Mean Estimate SD Estimate
Naive 0.230 0.180
RE 0.903 0.120
FE 0.988 0.107
Group Mean Covariate 0.988 0.107
Group Mean Centered 0.988 0.107
Group Mean Covariate, no RE 0.988 0.107
Group Mean Centered, no RE 0.988 0.107
Panel 0.996 0.136

But the difference is not as pronounced due to the weakening of the correlation between temperature and recruitment. However, the site RE is now clearly not 0.

rec_var_df %>%
  filter(model_type %in% c("Group Mean Covariate", "Group Mean Centered")) %>%
  unnest(coefs) %>%
  filter(group == "site") %>%
  group_by(`Model Type` = model_type) %>%
  summarize(Term = "Site Random Effect",
            `Mean Site SD` = mean(estimate),
            `SD in Site SD` = sd(estimate)) %>%
  knit_table
Model Type Term Mean Site SD SD in Site SD
Group Mean Covariate Site Random Effect 0.988 0.299
Group Mean Centered Site Random Effect 0.988 0.299

This means that we can now look at residual variation due to between site differences as well as residual replicate-level variation. Further, while models without a site random effect can be fit and used, these models will be structurally incorrect - replicates are not IID, as there is correlated error within sites.

Practically, the difference comes in with respect to whether you are interested in between site variation or not. The residual inflates with no RE, as it is now the combination of between site residual and within site residual terms. This could make a difference for various statistical tests down the line as well, but in terms of parameter estimates, we are still estimating a clean causal effect.

rec_var_df %>%
  filter(grepl("Group Mean", model_type)) %>%
  unnest(out_stats) %>%
  group_by(`Model Type` = model_type) %>%
  summarize(`Mean Residual SD` = mean(sigma),
            `SD in Residual SD` = sd(sigma)) %>%
  ungroup() %>%
  knit_table
Model Type Mean Residual SD SD in Residual SD
Group Mean Covariate 0.993 0.083
Group Mean Centered 0.993 0.083
Group Mean Covariate, no RE 1.355 0.188
Group Mean Centered, no RE 1.355 0.188

What About Unbalanced Data?

One of the advantages to mixed modeling approaches is how they handle unbalanced data. What if, in the above example, we had lost samples from each site generating unbalanced data?

set.seed(31415)
#what do we keep?
years_to_keep <- round(runif(10, 3, 10)) %>%
    imap_dfr( ~ tibble(site = as.character(.y), 
                       year = sample(1:10, .x),
                       keep = T))

#a function for unbalancing
unbalance <- function(site_year_df, keep_df){
  left_join(site_year_df, keep_df) %>%
    filter(keep) 
}

#apply keeping to each site-year
unbalanced_df <- rec_var_df %>% 
  select(-model_type, -fit, -coefs, -out_stats, -temp_effect) %>%
  mutate(site_year = map(site_year, unbalance, 
                            keep_df = years_to_keep),
         analysis = map(site_year, analyze_plots)) %>%
    unnest(analysis)  

How does this lack of balance affect estimation of causal coefficients?

unbalanced_df %>%
    unnest(temp_effect) %>%
  group_by(`Model Type` = model_type) %>%
  summarize(`Mean Estimate` = mean(estimate),
            `SD Estimate` = sd(estimate)) %>%
  knit_table
Model Type Mean Estimate SD Estimate
Naive 0.239 0.198
RE 0.839 0.160
FE 0.994 0.130
Group Mean Covariate 1.000 0.130
Group Mean Centered 1.000 0.130
Group Mean Covariate, no RE 1.014 0.157
Group Mean Centered, no RE 1.014 0.157
Panel 0.997 0.157

We can see that point estimation here is improved somewhat. Although this could vary. More telling would be an exporation of those group means in the mixed versus fixed models.